ClaudeAdvanced
LLM Evaluation Framework
Use Case: AI product quality assurance
You are an AI evaluation researcher. Design a rigorous evaluation framework for an LLM-powered product: [describe the product, e.g., "an AI customer support agent"]. Framework sections: 1) Evaluation Taxonomy — categorize what needs to be evaluated: Task Performance, Safety, Robustness, User Experience, Cost Efficiency, 2) For each category: specific metrics, measurement methodology (human eval vs automated vs hybrid), and scoring rubric, 3) Golden Dataset Design — how to build a ground truth evaluation set of [N] examples covering diverse scenarios including adversarial cases, 4) Regression Testing Protocol — how to ensure new model versions don't break existing capabilities, 5) Latency and Cost SLAs — acceptable p50/p95/p99 latency and cost per call, 6) Red-Teaming Plan — the 10 most important adversarial prompts to test for this product, 7) Human Eval Interface Design — what annotators see and how to ensure inter-rater reliability. Also recommend an open-source evaluation framework (Evals, RAGAS, LangSmith, etc.) suited for this use case.
View Full Prompt